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Random Forest 


Roadmap 


@ Embedding Numerous Features: Kernel Models 
® Combining Predictive Features: Aggregation Models 


Lecture 9: Decision Tree 


recursive branching (purification) for conditional 
aggregation of constant hypotheses 














Lecture 10: Random Forest 


ə Random Forest Algorithm 
ə Out-Of-Bag Estimate 

ə Feature Selection 

ə Random Forest in Action 


Ө Distilling Implicit Features: Extraction Models 
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Random Forest Random Forest Algorithm 


Recall: Bagging and Decision Tree 











Bagging Decision Tree 
function Bag(D, A) function DTree(D) 
Fort —1,2,...,T if termination return base 0; 
АСА! ^ else 
9 Мила шалан зэр a @ learn b(x) and split D to 
оар with D | D. by b(X) 
Ө obtain Вазе 9: by АС) Ө build G; — DTree(D,) 
return G = Uniform({9;}) Ө return G(x) = 
C 
> [b(X) = c] Gc(x) 
= 
—reduces variance —large variance 
by voting/averaging especially if fully-grown 
putting them together? 
(i.e. aggregation of aggregation :-) ) | 
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Random Forest Random Forest Algorithm 


Random Forest (RF) 
random forest (RF) = bagging + fully-grown C&RT decision tree | 





function RandomForest(D) function DTree( D) 
Fo 2 cre if termination return base g; 


@ request size-N' data D; by 
bootstrapping with D 

O obtain tree g; by DTree(D;) 
return G = Uniform({9:}) 


else 
Q learn b(x) and split D to 
То by б(х) 
Ө build Gc — DTree(P-) 
© return G(x) = 
С 


>, 10(х) = e| Gc(X) 


c=] 








e highly parallel/efficient to learn 
e inherit pros of C&RT 
e eliminate cons of fully-grown tree 
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Random Forest Random Forest Algorithm 


Diversifying by Feature Projection 


recall: data randomness for diversity in bagging 
randomly sample N' examples from 2 


another possibility for diversity: 





randomly sample d' features from x 


е when sampling index й, /2,..., ig: Ф(Х) = (Xi Xis- -> Xi) 
е Z c R^: a random subspace of X є R^ 


often а” < d, efficient for large d 
—can be generally applied on other models 


е original RF re-sample new subspace for each b(x) in C&RT 


RF = bagging + random-subspace C&RT 
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Random Forest Random Forest Algorithm 


Diversifying by Feature Expansion 


randomly sample a’ features from x: ®(x) = P- x 
with row / of P sampled randomly е natural basis | 
more powerful features for diversity: row / other than natural basis 
° projection (combination) with random row p; of P: ф(х) = p/ x 


e often consider low-dimensional projection: 
only 2” non-zero components in p; 


e includes random subspace as special case: 
d" = 1 and p; е natural basis 


e original RF consider d’ random low-dimensional! projections for 
each b(x) in C&RT 


RF = bagging + random-combination C&RT 
—randomness everywhere! | 
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Random Forest Random Forest Algorithm 


Fun Time 


Within RF that contains random-combination C&RT trees, which of the 
following hypothesis is equivalent to each branching function b(x) 
within the tree? 


@ a constant 

© a decision stump 

© a perceptron 

© none of the other choices 





Random Forest Random Forest Algorithm 


Fun Time 


Within RF that contains random-combination C&RT trees, which of the | 
following hypothesis is equivalent to each branching function b(x) 
within the tree? 

© a constant 

® a decision stump 

© a perceptron 

© none of the other choices L. 





Reference Answer: (3) 


In each b(x), the input vector x is first 
projected by a random vector v and then 
thresholded to make a binary decision, which 
is exactly what a perceptron does. 
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Random Forest Out-Of-Bag Estimate 


Bagging Revisited 


Bagging 
function Bag(D, A) 
БОГЦ 21:277711 
@ request size-N' data D; 
by bootstrapping with D 
Ө obtain base g, by A(D;) 
return G = Uniform(1gi;]) 


x in t-th column: not used for obtaining g; 
—called out-of-bag (OOB) examples of g; | 


| | И oe | | 90r 
121... 


| (xo) | * | < | Da 


бя р | | 





aoe ee 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/22 


Random Forest Out-Of-Bag Estimate 


Number of OOB Examples 
ООВ (іп x) < not sampled after N’ drawings | 






e probability for (Xn, ул) to be OOB for gr: (1 — 
e if N large: 


ООВ size per g, = {N | 
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Random Forest Out-Of-Bag Estimate 


OOB versus Validation 


OOB Validation 


ае | ge | | gr 
Yi) || х |Ds| |7” 
xey) | x | x [Pa| |7” 


FQ.) e [Be |> [Br 
КЕГЕН ШЕШ h 
ШЕТУ ЕЛИ НЕЗ БЕЗЕ ЕНЕГЕ 





e x like Dya: ‘enough’ random examples unused during training 
e use « to validate g;? easy, but rarely needed 
e use х to validate G? Е,с (G) = 2 У^, err(yn, 95 (Xn), 
with G, contains only trees that x, is OOB of, 
such as Gv (x) = average(goe, gs, От) 


Есор: self-validation of bagging/RF | 
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Random Forest Out-Of-Bag Estimate 


Model Selection by OOB Error 


Previously: by Best Eva RF: by Best Eoop 


От = Am(P) Gm = ВР (2) 
m = argmin Em m" argmin Ел 
1<m<M 1<m<M 


Ет = Eval(Am(Pirain)) Ет = Eooo( RFm(D)) 


e use Есор for self-validation 
—of RF parameters such 
as а” 


e no re-training needed 











Bu B eom 
= => 


ле , E ) 








Есор Often accurate in practice | 
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Üm* 


Random Forest Out-Of-Bag Estimate 


Fun Time 


For a data set with N = 1126, what is the probability that (X1126, ¥1126) 
is not sampled after bootstrapping N’ = N samples from the data set? 


@ 0.113 
@ 0.368 
© 0.632 


Ф 0.887 





Random Forest Out-Of-Bag Estimate 


Fun Time 


For a data set with N = 1126, what is the probability that (X4126, У1126) 
is not sampled after bootstrapping N’ = N samples from the data set? 
@ 0.113 
@ 0.368 
@ 0.632 
® 0.887 |j 


Reference Answer: (2) 


The value of (1 — 4)" with N = 1126 is about | 
0.367716, which is close to £ = 0.367879. 
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Random Forest Feature Selection 


Feature Selection 
for X = (X1, X2, . . - , Ха), want to remove 
e redundant features: like keeping one of ‘age’ and ‘full birthday’ 
e irrelevant features: like insurance type for cancer prediction 


and only ‘learn’ subset-transform Ф(х) = (Xi, х, X; , ) 
with a’ < d for 0(Ф(х)) 




















advantages: 
e efficiency: simpler 
hypothesis and shorter 


disadvantages: 


e computation: 
‘combinatorial optimization 


prediction time in training 
e generalization: ‘feature e overfit: ‘combinatorial’ 
noise’ removed selection 


e interpretability e mis-interpretability 


decision tree: a rare model 
with built-in feature selection | 
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Random Forest Feature Selection 


Feature Selection by Importance 


idea: if possible to calculate 
importance(/) for i = 1,2,...,d 


then can select i4, io,..., ig: of top-d’ importance 





Importance by linear model 


d 
score = w Í x = у шх; 
i=1 
e intuitive estimate: importance(/) = |w;| with some ‘good’ w 
e getting ‘good’ w: learned from data 
e non-linear models? often much harder 





next: ‘easy’ feature selection in RF | 
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Random Forest Feature Selection 


Feature Importance by Permutation Test 


idea: random test 
—if feature / needed, ‘random’ values of x, ; degrades performance | 


e which random values? 


e uniform, Gaussian, ...: P(x;) changed 
е bootstrap, permutation (of {xn ;}^_4): Р(х) approximately 
remained 


e permutation test: 


importance(i) = performance(D) — performance(D' ) 


with D? is D with {Xn i} replaced by permuted (x; 





permutation test: a general statistical tool for 
arbitrary non-linear models like RF | 
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Random Forest Feature Selection 


Feature Importance in Original Random Forest 
permutation test: 


importance(i) = performance(D) — performance(D' ) 





with D? is D with {Xn ;} replaced by permuted (Х,/17 


е performance(D(?): needs re-training and validation in general 
e ‘escaping’ validation? OOB in RF 
e original RF solution: importance(/) = Eso (0) — Е, 


сор C); 
where EG) comes from replacing each request of x»; by a 


permuted OOB value 


RF feature selection via permutation + OOB: 
often efficient and promising in practice | 
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Random Forest Feature Selection 


Fun Time 


For RF, if the 1126-th feature within the data set is a constant 5566, 
what would importance(/) be? 


@ 0 
@ 1 
Ө 1126 
@ 5566 





Random Forest Feature Selection 


Fun Time 


For RF, if the 1126-th feature within the data set is a constant 5566, 
what would importance(/) be? 

@ 0 

© | 

Ө 1126 

@ 5566 





Reference Answer: D 

When a feature is a constant, permutation 
does not change its value. Then, Eoob( G) and 
E(P) (G) are the same, and thus 
importance(/) = 0. 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (№ = N/2) G with first t trees 


with random combination 
x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (№ = N/2) G with first t trees 


with random combination 
M : 





x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (N' = N/2) G with first t trees 
with random combination 


t = 200 t - 200 


x 





x 
о 
о 
х 
х 
ia 
x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (№ = N/2) G with first t trees 
with random combination 


: /” > 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (N' = N/2) G with first t trees 
with random combination 


t= 400 t - 400 





x 
о 
о 
х 
х 
ia 
x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт 9: (№ = N/2) G with first t trees 
with random combination 
x ° x А yv 
\ 
t = 500 t = 500 
о . о 
о ° 
4 о 
x x x 
x x 
6 х = x 29 
x x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (№ = N/2) G with first t trees 
with random combination 


x е \ х 
t = 600 t = 600 
о ы о 
о © о 
[C] о 
х ° x 
x . x 
x 4 © 
x ) 


x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (N' = N/2) G with first t trees 
with random combination 


t= 700 t= 700 


ө о 
° о 


х 
о 
о 
х 
х E 
x Ч х 
х х 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (N' = N/2) G with first t trees 


with random combination 
x x 


t = 800 t - 800 


. ° 
° о 


x 
о 
о 
х 
x B 
x x 
ia | ' А 
х х 
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Random Forest Random Forest in Action 


A Simple Data Set 


Ос&вт gt (N' = N/2) G with first t trees 
with random combination 


x = 
t = 900 t= \ 
о 
о ө 
х 
х B 


x x 
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Random Forest Random Forest in Action 


A Simple Data Set 


Üc&nr gt (N' = N/2) G with first t trees 
with random combination 


x l x 24 

t= 1000 t= 1000 
° о 
° ° 
° 
ы x 
x x 
x 


x 
° 
° 
x 
x 
x ) х 
‘smooth’ and large-margin-like boundary 
with many trees | 
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Random Forest Random Forest in Action 


A Complicated Data Set 


9: ( Шы = №2) m with first t trees 
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Random Forest Random Forest in Action 


A Complicated Data Set 








gt (N' = N/2) G with first t trees 
© ° е е ° 
és. s 


yi 


e e. © 
та. ue 
524,94 
99 е 









е! © 
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Random Forest Random Forest in Action 


A Complicated Data Set 


G with first t trees 
qp 
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Random Forest Random Forest in Action 


A Complicated Data Set 


G with first t trees 
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Random Forest Random Forest in Action 


A Complicated Data Set 


G with first t trees 





easy yet robust’ nonlinear model 
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Random Forest Random Forest in Action 


A Complicated and Noisy Data Set 


G with first t trees 
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Random Forest Random Forest in Action 


A Complicated and Noisy Data Set 
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Random Forest Random Forest in Action 


A Complicated and Noisy Data Set 
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Random Forest Random Forest in Action 


A Complicated and Noisy Data Set 


G with first t trees 
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Random Forest Random Forest in Action 


A Complicated and Noisy Data Set 


gt (N' = N/2) G with first t trees 
© > 





noise corrected by voting 
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Random Forest Random Forest in Action 


How Many Trees Needed? 


almost every theory: the more, the ‘better’ 
assuming good g = lim; G | 
Our NTU Experience 


е KDDCup 2013 Track 1 (yes, NTU is world champion again! :-)): 
predicting author-paper relation 


е Буд of thousands of trees: [0.015, 0.019] depending on seed; 
Есш Of top 20 teams: [0.014, 0.019] 


e decision: take 12000 trees with seed 1 










cons of RF: may need lots of trees if the 
whole random process too unstable 
—should double-check stability of G 

to ensure enough trees 
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Random Forest Random Forest in Action 


Fun Time 


Which of the following is not the best use of Random Forest? 
@ train each tree with bootstrapped data 
Ө use Есор to validate the performance 
©) conduct feature selection with permutation test 
@ fix the number of trees, 7, to the lucky number 1126 





Random Forest Random Forest in Action 


Fun Time 


Which of the following is not the best use of Random Forest? 
@ train each tree with bootstrapped data 
© use Есор to validate the performance 
©) conduct feature selection with permutation test 
@ fix the number of trees, 7, to the lucky number 1126 





Reference Answer: (4) 


A good value of 7 can depend on the nature of 
the data and the stability of the whole random 


process. | 
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Summary 


@ Embedding Numerous Features: Kernel Models 
® Combining Predictive Features: Aggregation Models 


Lecture 10: Random Forest 


ə Random Forest Algorithm 
bag of trees on randomly projected subspaces 
ə Out-Of-Bag Estimate 
self-validation with OOB examples 
ə Feature Selection 
permutation test for feature importance 
ə Random Forest in Action 
‘smooth’ boundary with many trees 





e next: boosted decision trees beyond classification 


© Distilling Implicit Features: Extraction Models 





